This project was done as a part of “Data Analysis” course as part of our teams studies in Digital Sciences for High-Tech in the University of Tel-Aviv. Our team has great interest in using our studies for exploring and in the future maybe even developing tools for improving the way we treat our planet. Therefore our subject is CO2 emissions in congestion with the world happiness index.
happy_index_2005 <- "Data/world-happiness-report-2005-2020.csv"
happy_index_2021 <- "Data/world-happiness-report-2021.csv"
emission <- "Data/co2_emission.csv"
population <-"Data/population.csv"
df_2005 <- read.csv(happy_index_2005)
df_2021 <- read.csv(happy_index_2021)
df_emiss <- read.csv(emission, col.names = c("Entity", "Code" , "Year", "co2"))
df_pop <- read.csv(population, col.names = c("Entity", "Code" , "Year", "Population"))
The first Data consists of the UN happiness index for the years 2005-2017.
as_tibble(df_2005)
The second data is the happiness index report for 2021.
as_tibble(df_2021)
The third data contains CO2 emissions from around 1700 up until 2017, by countries and continents.
as_tibble(df_emiss)
The final data is a list of population sizes by countries for about the same years as the CO2 data.
as_tibble(df_pop)
As seen in the table above, the main dataset regarding emissions has only 4 features, of which 2 are identical (countries and country code). We will now examine the summary of it’s characteristics:
summary(df_emiss)
Entity Code Year co2
Length:20853 Length:20853 Min. :1751 Min. :-6.255e+08
Class :character Class :character 1st Qu.:1932 1st Qu.: 3.188e+05
Mode :character Mode :character Median :1971 Median : 3.829e+06
Mean :1953 Mean : 1.931e+08
3rd Qu.:1995 3rd Qu.: 3.707e+07
Max. :2017 Max. : 3.615e+10
As seen above, we can see the minimal year, minimal amount of CO2 emissions. Also shown are mean and max values. We also learn about the size of the data, around ~21k rows.
[1] "ï..Country.name" "year" "Life.Ladder"
[4] "Log.GDP.per.capita" "Social.support" "Healthy.life.expectancy.at.birth"
[7] "Freedom.to.make.life.choices" "Generosity" "Perceptions.of.corruption"
[10] "Positive.affect" "Negative.affect"
Joining, by = "Entity"
We have just created the two main datasets to work with, which includes important features from all of the previous datasets and some features that we have created above:
We also made a choice to limit our data regarding emissions to after 1950. The reason is that the rise in values is almost exponential after the years of WW2. When plotted on a graph, placing values from before and after 1950 they become incomparable.
The summary for df_merge:
summary(df_merge)
Entity avg.ratio avg.Population avg.co2 Life.Ladder
Length:138 Min. : 0.03123 Min. :2.900e+05 Min. :2.539e+05 Min. :2.662
Class :character 1st Qu.: 0.67361 1st Qu.:4.486e+06 1st Qu.:4.375e+06 1st Qu.:4.593
Mode :character Median : 3.04396 Median :9.762e+06 Median :2.017e+07 Median :5.587
Mean : 4.87430 Mean :4.360e+07 Mean :1.902e+08 Mean :5.463
3rd Qu.: 7.70521 3rd Qu.:2.873e+07 3rd Qu.:9.600e+07 3rd Qu.:6.262
Max. :25.67181 Max. :1.296e+09 Max. :5.558e+09 Max. :7.788
sum.dif
Min. :-675355128
1st Qu.: 142380
Median : 6399728
Mean : 102976353
3rd Qu.: 28576960
Max. :7786512016
And for df_all:
summary(df_all)
Entity Code Year co2 Diff
Length:17854 Length:17854 Min. :1950 Min. :-6.255e+08 Min. :-600971748
Class :character Class :character 1st Qu.:1967 1st Qu.: 4.746e+05 1st Qu.: -9810
Mode :character Mode :character Median :1984 Median : 4.430e+06 Median : 43968
Mean :1984 Mean : 2.423e+08 Mean : 5310082
3rd Qu.:2002 3rd Qu.: 4.593e+07 3rd Qu.: 1018269
Max. :2019 Max. : 3.615e+10 Max. :1543508239
NA's :3595 NA's :3595
Population ratio
Min. :1.000e+03 Min. : 0.000
1st Qu.:2.400e+05 1st Qu.: 0.408
Median :3.386e+06 Median : 1.841
Mean :6.108e+07 Mean : 5.211
3rd Qu.:1.230e+07 3rd Qu.: 6.384
Max. :7.713e+09 Max. :403.351
NA's :914 NA's :4509
In this plot the different countries in the data are shown in deeper colors for higher values of emission to population ratio in the year of 2017
d = df_all %>%
filter(Year==2017)
l <- list(color = toRGB("grey"), width = 0.2)
# specify map projection/options
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
)
p <- plot_geo(d) %>%
add_trace(
z = ~ratio, color = ~ratio, colors = 'Reds',
text = ~Entity, locations = ~Code, marker = list(line = l)
) %>%
colorbar(title = 'CO2/Population', thickness=15) %>%
layout(
title = 'World Ratio of CO2/Population',
geo = g,
autosize = F
)
p <- ggplotly(p, width = 3000, height = 1500, automargin = TRUE)
p
As we can see, some of the leading countries in CO2/Population ratio are Qatar, United Arab Emirates, Kuwait, USA, Saudi Arabia, Canada, Kazakhstan and Australia with above 15 units of CO2/population. That means countries like China and India - whose CO2 emission rate is very high compared to other countries - are ranked lower, even though they are more polluting in net amount.
For comfortably displaying the data, each country is allocated to it’s fitting continent and a color.
fig <- d %>%
plot_ly(
x = ~co2,
y = ~Population,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Population",
xaxis = list(
type = "log"
# autosize = F
)
)
fig
We would like to develop this graph forward and see it with correlation to the life ladder index from 2005-2017. For this purpose we would like to create another df, containing both happiness index scores and data from “df_all”, which contains continents data as well now:
We get the following plot:
fig <- plot_ly(
data = df_2005_2017,
x = ~co2,
y = ~Life.Ladder,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Life Ladder",
xaxis = list(
type = "log"
)
)
fig
And yet another scatter plot for different countries:
df_3d <- df_2005_2017 %>%
filter(Year == 2017)
fig <- df_3d %>%
plot_ly(
x = ~Population,
y = ~ratio,
z = ~Life.Ladder,
# size = ~Life.Ladder,
# frame = ~Entity,
text = ~Entity,
color = ~continent,
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "CO2 Emissions vs Life Ladder, Filtered by Countries, 2017"
)
fig <- fig %>% add_markers()
fig
This graph shows the CO2 emissions of each country compared to it’s population and Life Ladder score. If we look at the United States for example, it seems that while the population has increased, the level of pollution has decreased.
# d <- df_all[ , c("Entity", "Year", "Population")]
d <- df_all %>%
filter(Entity != "World") %>%
filter(co2 != 0)
d
pp <- streamgraph(d, key="Entity",
order = "asis",
value="co2",
date="Year",
offset="zero",
sort="co2"
) %>%
sg_axis_y(tick_format = "e") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("CO2 Emissions by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
This plot shows the accumulation of emissions year after year, from 1950 to 2017 for different countries.
This graph shows the accumulative happiness index for each country from 2005 to the present.
pp <- streamgraph(df_2005, key = "Entity",
# order = "reverse",
value="Life.Ladder",
date="year"
# offset="zero",
) %>%
sg_fill_brewer("Blues") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("Happiness Index by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
top_emiss_2017 <- df_emiss %>%
filter(Year == 2017) %>%
filter(Code != '') %>%
filter(Entity != "World") %>%
top_n( 10, co2) %>%
arrange(desc(co2)) %>%
head(10)
top_countries = top_emiss_2017$Entity
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_countries, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 Countries by CO2 Emissions") +
geom_line(aes(col = Entity)) +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
g <- ggplot(data = df_all[df_all$Entity %in% top_countries, ]
, aes(x = Year, y = ratio, group = Entity)) +
geom_line(aes(col = Entity)) +
labs(title = "Top 10 Countries by CO2 Emissions to Population ratio") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
According to these two graphs, using the CO2/Population ratio. For example, Saudi Arabia has a lower population than china, but in terms of pollution per person in 2017 Saudi Arabia had the highest emission per person ratio in the world.
d <- df_all[df_all$Entity %in% top_countries, ]
fig <- plot_ly(d,
x = ~Entity,
y = ~Diff,
type = "box",
color = ~Year,
height = 500,
width = 900,
automargin = TRUE)
fig <- fig %>% layout(
title = "Difference in Emission for Top 10 Polluting Countries",
boxmode = "group")
fig
NA
This graph takes the 10 happiest countries and shows the distribution of CO2 emissions difference level of emission between 1950 and 2017.
top_10_happiness <- df_merge %>%
filter(rank(desc(Life.Ladder))<=10)
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_10_happiness$Entity, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 happiness Countries by CO2 Emissions") +
theme_ipsum() +
geom_line(aes(col = Entity))
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In this section we are going to suggest two types of correlation based on our exploration of the data:
g <- ggplot(data = df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
labs(title = "Average Ratio vs Happiness Index") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In the following plot we have removed china and India because they are sort of “outliers” in a sense that they have displayed much larger grow in emissions compared to other countries, therefore making it very hard to display on the same plot.
df <- subset(df_merge, Entity != "China" & Entity != "India" )
g <- ggplot(data = df, aes(x = Life.Ladder , y = sum.dif)) +
geom_point(aes(colour = Entity)) +
labs(title = "Sum of Emissions Difference vs 2017 Happiness Index (China, India ex.)") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
# Part 4: Modeling
We are now going to introduce the linear model:
linearmod <- lm(Life.Ladder ~ avg.ratio, data=df_merge)
summary(linearmod)
Call:
lm(formula = Life.Ladder ~ avg.ratio, data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.2447 -0.6262 -0.0959 0.6765 2.1682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.88716 0.10612 46.055 < 2e-16 ***
avg.ratio 0.11818 0.01416 8.346 7.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9468 on 136 degrees of freedom
Multiple R-squared: 0.3387, Adjusted R-squared: 0.3338
F-statistic: 69.65 on 1 and 136 DF, p-value: 7.093e-14
confint(linearmod, level=0.95)
2.5 % 97.5 %
(Intercept) 4.67731147 5.0970157
avg.ratio 0.09017982 0.1461869
p <- plot_ly(
x = fitted(linearmod),
y = residuals(linearmod),
height = 500,
width = 900,
automargin = TRUE)
p<- p %>% layout(
title = "Residual plot for the Linear Model",
autosize = T,
yaxis = list(title = 'Residuals'),
xaxis = list(title = 'Fitted Linear Model')
)
p
The residual plot shows no signs of
g <- ggplot(df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
theme_bw() +
stat_smooth(method = "lm") +
labs(title = "Linear Modelling Happiness Index and Average Ratio")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
`geom_smooth()` using formula 'y ~ x'
g
NA
NA
According to the model, it can be seen that the correlation between the happiness index and the CO2 emission is not high, but it does exist. We will discuss it in the conclusions section.
numeric_tidy <- df_merge[-1]
corr_data <- cor(numeric_tidy)
g <- ggcorrplot(corr_data, hc.order = TRUE, type = "lower",
outline.col = "white") +
labs(title = "Correlation Matrix") +
theme_ipsum()
# colors = c("darkolivegreen2", "white", "darkolivegreen"))
ax <- list(
title = "",
showgrid = FALSE)
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE) %>%
layout(xaxis=ax, yaxis=ax)
g
NA
NA
We can see in the matrix above that there are some correlation that are easy to explain, such as a pretty strong one between the sum of differences regarding emissions over the test period, to the average size of population. There is also a weaker but existing correlation between the life ladder scale and the average ratio of CO2/Population size. The following section presents another model and it’s summary:
gmod <- lm(Life.Ladder ~ log(avg.ratio), data=df_merge)
summary(gmod)
Call:
lm(formula = Life.Ladder ~ log(avg.ratio), data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.03388 -0.55089 0.04812 0.59537 1.89653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.15234 0.07550 68.25 <2e-16 ***
log(avg.ratio) 0.48724 0.04229 11.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8283 on 136 degrees of freedom
Multiple R-squared: 0.4939, Adjusted R-squared: 0.4902
F-statistic: 132.7 on 1 and 136 DF, p-value: < 2.2e-16
#plot x
p <- qplot(x = avg.ratio, Life.Ladder, data=df_merge)
p <- p + geom_smooth(method = "glm", formula = y~log(x), family = binomial(link = 'log')) +
geom_point(aes(colour=Entity)) +
theme_ipsum() +
labs(title = "Linear Regression Model - Average Ratio to Life Ladder Score")
p <- ggplotly(p, height = 500,
width = 900,
automargin = TRUE)
p
On this model, We can see a weak correlation, but an existing one to the logistic regression. Interestingly enough, The “happiest” countries seem to have a higher ratio of CO2 emissions as well. Also worth mentioning from this plot: * The US is located far right on this map, meaning it’s emitting a lot of CO2 in proportion to it’s population, but it is pretty high on the happiness index as well. * Kuwait, being the leader of this unfortunate characteristic, also has a pretty decent Life Ladder score.
To further check existing correlation, we would compare the differences between the 30 top rated countries in the happiness index and their avg.ratio of CO2/Population. In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it is possible to use an independent t-test to evaluate how their means differ.
Our research questions:
Is the mean avg.ratio of top 30 happiness (\(mA\)) is greater than the mean of other countries (\(mB\))?
Is the weighted average of the 30 happiness countries (mA) greater than the weighted average of the other countries (\(mB\))?
H0: $mA ≥ mB $ - The null hypothesis
Ha: $mA < mB $ (less) - The alternative hypothesis
top_30_happiness <- df_merge %>%
filter(rank(desc(Life.Ladder))<=30)
# Create a data frame
T_data <- df_merge %>%
select(Entity, avg.ratio,) %>%
mutate(group = ifelse(Entity %in% top_30_happiness$Entity, "Top 30 Contries", "Other Countries"))
## TODO Yarden - Need to add to T_data a column of Life.Ladder
## TODO make a graph that shows the difference of Life Ladder and avg.ratio by groups "top_10" and "others".
g <- ggboxplot(T_data, x = "group", y = "avg.ratio",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "avg. ratio", xlab = "Group") +
theme_ipsum() +
labs(title = "Average Ratio Compared - Top 30 Countries vs Other Countries")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
NA
Although not dramatic, it is obvious that some correlation does exist. We would like to further check:
#Do the two group have the same variances?
#We’ll use F-test to test for homogeneity in variances.
res.ftest <- var.test(avg.ratio ~ group, data = T_data)
res.ftest
F test to compare two variances
data: avg.ratio by group
F = 0.64534, num df = 107, denom df = 29, p-value = 0.112
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3405023 1.1067142
sample estimates:
ratio of variances
0.6453379
The p-value of F-test is p = 0.112 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the variances of the two data sets. Therefore, we used the T-test and assumed that the variances are equal - according to case 2 of hypothesis testing.
# Compute t-test
res <- t.test(avg.ratio ~ group, data = T_data, var.equal = TRUE, alternative = "less")
res
Two Sample t-test
data: avg.ratio by group
t = -5.2316, df = 136, p-value = 3.103e-07
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -3.860236
sample estimates:
mean in group Other Countries mean in group Top 30 Contries
3.646415 9.294677
The p-value of the test is \(3.103e-07\), which is less than the significance level \(\alpha = 0.05\). We can conclude that top 30 average avg.ratio is significantly different from the other countries average avg.ratio with a \(pvalue = 3.103e-07\).
Generally, regarding our base assumptions, we were surprised to see that they were actually not only wrong, but almost opposite. while very undeveloped countries, like countries in Africa mostly, have a very low pollution index and life ladder index. On the other hand, some very developed countries have higher values on both indexes. There are also a lot of countries who don’t fit those assumptions at all.
Although, our final tests did show some weak correlation within the top 30 countries as shown.
we stipulate that these weak correlations are caused partially because of the fact that the happiness index is a survey based data. it means that it is possible that the citizens of a country are not necessarily aware of the CO2 pollution and it’s consequences. There are also other factors that affect the happiness index like welfare , corruption, health services, wealth etc.
On the other hand: 1. It can be assumed that a country that emits more CO2 is a more modern and technologically advanced country, therefore it is more established, its citizens feel a sense of progress, and that causes the happiness index to rise. 2. In more polluting countries - it is a possible that there are more workplaces, therefore these countries will have a lower unemployment rate - which can in turn increase the level of happiness index within the country.
In conclusion, in our opinion, the people may be aware of their country CO2 emissions, yet it doesn’t necessarily show in the happiness index survey.
Another thing to consider is that environmental preservation, pollution reduction and global warming are quite trending in the last few years, but this is a very complex system of dependencies, and the rate of which changes are noticeable is rapidly changing. It is unknown whether we have crossed the non-return point for making the planet to hot to live at due to those emissions, but time will tell.